Predictive Analytics for Solar Power Generation: Developing Machine Learning Models Using Meteorological Data¶
AI 221 project by:
Daniel De Castro (ddecastro2@up.edu.ph)
University of the Philippines Diliman
Problem Statement¶
The fluctuating nature of solar power generation, primarily due to varying environmental conditions, presents significant challenges in energy management and grid integration. Existing prediction models often do not fully account for the complex interactions of meteorological factors, leading to inaccuracies in solar power output forecasts. This unpredictability impacts the efficiency of energy distribution and the reliability of solar power as a consistent energy source. An improved predictive model that comprehensively considers various environmental influences is crucial for enhancing the predictability and utility of solar energy.
Project Objective¶
The goal of this project is to harness the power of machine learning to develop robust models that accurately predict solar power generation from meteorological data. The objectives are as follows:
Data Analysis and Feature Engineering: Conduct an in-depth analysis of the 'Solar Energy Power Generation' dataset. Identify key environmental factors and engineer features that effectively capture the dynamics impacting solar power output.
Model Development: Explore and implement a range of machine learning algorithms, including but not limited to, linear regression, ensemble methods, and neural networks, to assess their suitability for this prediction task.
Model Training and Validation: Utilize cross-validation strategies to train and fine-tune the models. Ensure that they not only fit the training data well but also generalize effectively to new, unseen data.
Performance Evaluation: Evaluate the performance of each model using standard regression metrics like RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and R² (Coefficient of Determination). The best model will be selected based on its performance on these metrics.
Insights and Recommendations: Provide insights into the impact of various meteorological factors on solar power generation. Based on the findings, offer practical recommendations for optimizing solar energy usage and future research directions.
By achieving these objectives, this project aims to significantly improve the predictability and efficiency of solar energy, thereby bolstering its viability as a sustainable energy resource.
import pandas as pd
RANDOM_SEED = 43 # for reproducibility
Dataset Overview¶
Kaggle - Solar Energy Power Generation Dataset provides comprehensive data on solar power generation along with various meteorological factors that could impact the generation of solar energy. It is a valuable resource for analyzing and predicting solar power generation based on environmental conditions.
dataset = pd.read_csv("spg.csv")
dataset.head()
| temperature_2_m_above_gnd | relative_humidity_2_m_above_gnd | mean_sea_level_pressure_MSL | total_precipitation_sfc | snowfall_amount_sfc | total_cloud_cover_sfc | high_cloud_cover_high_cld_lay | medium_cloud_cover_mid_cld_lay | low_cloud_cover_low_cld_lay | shortwave_radiation_backwards_sfc | ... | wind_direction_10_m_above_gnd | wind_speed_80_m_above_gnd | wind_direction_80_m_above_gnd | wind_speed_900_mb | wind_direction_900_mb | wind_gust_10_m_above_gnd | angle_of_incidence | zenith | azimuth | generated_power_kw | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.17 | 31 | 1035.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0.00 | ... | 312.71 | 9.36 | 22.62 | 6.62 | 337.62 | 24.48 | 58.753108 | 83.237322 | 128.33543 | 454.10095 |
| 1 | 2.31 | 27 | 1035.1 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 1.78 | ... | 294.78 | 5.99 | 32.74 | 4.61 | 321.34 | 21.96 | 45.408585 | 75.143041 | 139.65530 | 1411.99940 |
| 2 | 3.65 | 33 | 1035.4 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 108.58 | ... | 270.00 | 3.89 | 56.31 | 3.76 | 286.70 | 14.04 | 32.848282 | 68.820648 | 152.53769 | 2214.84930 |
| 3 | 5.82 | 30 | 1035.4 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 258.10 | ... | 323.13 | 3.55 | 23.96 | 3.08 | 339.44 | 19.80 | 22.699288 | 64.883536 | 166.90159 | 2527.60920 |
| 4 | 7.73 | 27 | 1034.4 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 375.58 | ... | 10.01 | 6.76 | 25.20 | 6.62 | 22.38 | 16.56 | 19.199908 | 63.795208 | 182.13526 | 2640.20340 |
5 rows × 21 columns
The dataset contains several columns, each representing different environmental and solar energy-related measurements. Key columns include:
dataset.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| temperature_2_m_above_gnd | 4213.0 | 15.068111 | 8.853677 | -5.350000 | 8.390000 | 14.750000 | 21.290000 | 34.90000 |
| relative_humidity_2_m_above_gnd | 4213.0 | 51.361025 | 23.525864 | 7.000000 | 32.000000 | 48.000000 | 70.000000 | 100.00000 |
| mean_sea_level_pressure_MSL | 4213.0 | 1019.337812 | 7.022867 | 997.500000 | 1014.500000 | 1018.100000 | 1023.600000 | 1046.80000 |
| total_precipitation_sfc | 4213.0 | 0.031759 | 0.170212 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.20000 |
| snowfall_amount_sfc | 4213.0 | 0.002808 | 0.038015 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.68000 |
| total_cloud_cover_sfc | 4213.0 | 34.056990 | 42.843638 | 0.000000 | 0.000000 | 8.700000 | 100.000000 | 100.00000 |
| high_cloud_cover_high_cld_lay | 4213.0 | 14.458818 | 30.711707 | 0.000000 | 0.000000 | 0.000000 | 9.000000 | 100.00000 |
| medium_cloud_cover_mid_cld_lay | 4213.0 | 20.023499 | 36.387948 | 0.000000 | 0.000000 | 0.000000 | 10.000000 | 100.00000 |
| low_cloud_cover_low_cld_lay | 4213.0 | 21.373368 | 38.013885 | 0.000000 | 0.000000 | 0.000000 | 10.000000 | 100.00000 |
| shortwave_radiation_backwards_sfc | 4213.0 | 387.759036 | 278.459293 | 0.000000 | 142.400000 | 381.810000 | 599.860000 | 952.30000 |
| wind_speed_10_m_above_gnd | 4213.0 | 16.228787 | 9.876948 | 0.000000 | 9.010000 | 14.460000 | 21.840000 | 61.18000 |
| wind_direction_10_m_above_gnd | 4213.0 | 195.078452 | 106.626782 | 0.540000 | 153.190000 | 191.770000 | 292.070000 | 360.00000 |
| wind_speed_80_m_above_gnd | 4213.0 | 18.978483 | 11.999960 | 0.000000 | 10.140000 | 16.240000 | 26.140000 | 66.88000 |
| wind_direction_80_m_above_gnd | 4213.0 | 191.166862 | 108.760021 | 1.120000 | 130.240000 | 187.770000 | 292.040000 | 360.00000 |
| wind_speed_900_mb | 4213.0 | 16.363190 | 9.885330 | 0.000000 | 9.180000 | 14.490000 | 21.970000 | 61.11000 |
| wind_direction_900_mb | 4213.0 | 192.447911 | 106.516195 | 1.120000 | 148.220000 | 187.990000 | 288.000000 | 360.00000 |
| wind_gust_10_m_above_gnd | 4213.0 | 20.583489 | 12.648899 | 0.720000 | 11.160000 | 18.000000 | 27.000000 | 84.96000 |
| angle_of_incidence | 4213.0 | 50.837490 | 26.638965 | 3.755323 | 29.408181 | 47.335557 | 69.197492 | 121.63592 |
| zenith | 4213.0 | 59.980947 | 19.857711 | 17.727761 | 45.291631 | 62.142611 | 74.346737 | 128.41537 |
| azimuth | 4213.0 | 169.167651 | 64.568385 | 54.379093 | 114.136600 | 163.241650 | 225.085620 | 289.04518 |
| generated_power_kw | 4213.0 | 1134.347313 | 937.957247 | 0.000595 | 231.700450 | 971.642650 | 2020.966700 | 3056.79410 |
temperature_2_m_above_gnd: Temperature measured 2 meters above the ground. It ranges from -5.35°C to 34.9°C with an average of 15.07°C.
relative_humidity_2_m_above_gnd: Relative humidity measured 2 meters above the ground, ranging from 7% to 100%, with an average of 51.36%.
mean_sea_level_pressure_MSL: Mean sea level pressure in millibars, varying from 997.5 to 1046.8, with an average of 1019.34.
total_precipitation_sfc: Total precipitation at the surface level, ranging from 0 to 3.2 mm, with an average of 0.03 mm.
snowfall_amount_sfc: Snowfall amount at the surface level, ranging from 0 to 1.68 mm, with an average of 0.003 mm.
total_cloud_cover_sfc: Total cloud cover percentage at the surface level, ranging from 0% to 100%, with an average of 34.06%.
high_cloud_cover_high_cld_lay: High cloud cover percentage in the high cloud layer, ranging from 0% to 100%, with an average of 14.46%.
medium_cloud_cover_mid_cld_lay: Medium cloud cover percentage in the mid cloud layer, ranging from 0% to 100%, with an average of 20.02%.
low_cloud_cover_low_cld_lay: Low cloud cover percentage in the low cloud layer, ranging from 0% to 100%, with an average of 21.37%.
shortwave_radiation_backwards_sfc: Shortwave radiation backwards at the surface level in watts per square meter, ranging from 0 to 952.3, with an average of 387.76.
wind_speed_10_m_above_gnd: Wind speed measured 10 meters above the ground in km/h, ranging from 0 to 61.18, with an average of 16.23.
wind_direction_10_m_above_gnd: Wind direction measured 10 meters above the ground in degrees, ranging from 0° to 360°, with an average of 195.08°.
wind_speed_80_m_above_gnd: Wind speed measured 80 meters above the ground in km/h, ranging from 0 to 66.88, with an average of 18.98.
wind_direction_80_m_above_gnd: Wind direction measured 80 meters above the ground in degrees, ranging from 1.12° to 360°, with an average of 191.17°.
wind_speed_900_mb: Wind speed at 900 millibars pressure level in km/h, ranging from 0 to 61.11, with an average of 16.36.
wind_direction_900_mb: Wind direction at 900 millibars pressure level in degrees, ranging from 1.12° to 360°, with an average of 192.45°.
wind_gust_10_m_above_gnd: Wind gust speed measured 10 meters above the ground in km/h, ranging from 0.72 to 84.96, with an average of 20.58.
angle_of_incidence: The angle of incidence in degrees, ranging from 3.76° to 121.64°, with an average of 50.84°.
zenith: The zenith angle in degrees, ranging from 17.73° to 128.42°, with an average of 59.98°.
azimuth: The azimuth angle in degrees, ranging from 54.38° to 289.05°, with an average of 169.17°.
generated_power_kw: The generated power in kilowatts, ranging from approximately 0 kW to 3056.79 kW, with an average of 1134.35 kW.
Data Preprocessing¶
# Handling missing values, if any
data = dataset.dropna()
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4213 entries, 0 to 4212 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 temperature_2_m_above_gnd 4213 non-null float64 1 relative_humidity_2_m_above_gnd 4213 non-null int64 2 mean_sea_level_pressure_MSL 4213 non-null float64 3 total_precipitation_sfc 4213 non-null float64 4 snowfall_amount_sfc 4213 non-null float64 5 total_cloud_cover_sfc 4213 non-null float64 6 high_cloud_cover_high_cld_lay 4213 non-null int64 7 medium_cloud_cover_mid_cld_lay 4213 non-null int64 8 low_cloud_cover_low_cld_lay 4213 non-null int64 9 shortwave_radiation_backwards_sfc 4213 non-null float64 10 wind_speed_10_m_above_gnd 4213 non-null float64 11 wind_direction_10_m_above_gnd 4213 non-null float64 12 wind_speed_80_m_above_gnd 4213 non-null float64 13 wind_direction_80_m_above_gnd 4213 non-null float64 14 wind_speed_900_mb 4213 non-null float64 15 wind_direction_900_mb 4213 non-null float64 16 wind_gust_10_m_above_gnd 4213 non-null float64 17 angle_of_incidence 4213 non-null float64 18 zenith 4213 non-null float64 19 azimuth 4213 non-null float64 20 generated_power_kw 4213 non-null float64 dtypes: float64(17), int64(4) memory usage: 691.3 KB
Exploratory Data Analysis (EDA)¶
import matplotlib.pyplot as plt
import seaborn as sns
Conduct a correlation analysis to identify pairs of variables that have a strong linear relationship. High correlation (both positive and negative) can indicate interesting pairs to visualize.
correlation_matrix = data.corr()
correlation_matrix
| temperature_2_m_above_gnd | relative_humidity_2_m_above_gnd | mean_sea_level_pressure_MSL | total_precipitation_sfc | snowfall_amount_sfc | total_cloud_cover_sfc | high_cloud_cover_high_cld_lay | medium_cloud_cover_mid_cld_lay | low_cloud_cover_low_cld_lay | shortwave_radiation_backwards_sfc | ... | wind_direction_10_m_above_gnd | wind_speed_80_m_above_gnd | wind_direction_80_m_above_gnd | wind_speed_900_mb | wind_direction_900_mb | wind_gust_10_m_above_gnd | angle_of_incidence | zenith | azimuth | generated_power_kw | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| temperature_2_m_above_gnd | 1.000000 | -0.771704 | -0.402240 | -0.083137 | -0.121422 | -0.326641 | -0.019522 | -0.100980 | -0.381876 | 0.665755 | ... | 0.051393 | -0.244869 | 0.086630 | -0.198107 | 0.043233 | -0.188264 | -0.090173 | -0.545646 | 0.381797 | 0.217280 |
| relative_humidity_2_m_above_gnd | -0.771704 | 1.000000 | 0.100529 | 0.168660 | 0.113987 | 0.402895 | 0.056452 | 0.135347 | 0.490402 | -0.721754 | ... | 0.008902 | 0.212868 | -0.019408 | 0.135464 | 0.021068 | 0.144807 | 0.268460 | 0.513748 | -0.525760 | -0.336783 |
| mean_sea_level_pressure_MSL | -0.402240 | 0.100529 | 1.000000 | -0.159098 | -0.053871 | -0.151995 | -0.014646 | -0.129812 | -0.162043 | -0.188387 | ... | -0.119867 | -0.131442 | -0.161020 | -0.145696 | -0.125234 | -0.189266 | -0.075619 | 0.268111 | -0.137872 | 0.150551 |
| total_precipitation_sfc | -0.083137 | 0.168660 | -0.159098 | 1.000000 | 0.184497 | 0.223678 | 0.076255 | 0.262367 | 0.282748 | -0.130358 | ... | 0.005234 | 0.052376 | 0.007131 | 0.044797 | 0.003216 | 0.066701 | -0.020965 | -0.023408 | 0.005749 | -0.118442 |
| snowfall_amount_sfc | -0.121422 | 0.113987 | -0.053871 | 0.184497 | 1.000000 | 0.112646 | -0.026356 | 0.042867 | 0.151609 | -0.073499 | ... | 0.039734 | 0.093156 | 0.041246 | 0.100405 | 0.041716 | 0.093060 | -0.012497 | 0.033554 | 0.008426 | -0.049508 |
| total_cloud_cover_sfc | -0.326641 | 0.402895 | -0.151995 | 0.223678 | 0.112646 | 1.000000 | 0.442865 | 0.712077 | 0.746225 | -0.345089 | ... | 0.055057 | 0.183732 | 0.039671 | 0.174510 | 0.057816 | 0.212142 | -0.003426 | 0.136249 | -0.037427 | -0.334338 |
| high_cloud_cover_high_cld_lay | -0.019522 | 0.056452 | -0.014646 | 0.076255 | -0.026356 | 0.442865 | 1.000000 | 0.593300 | 0.024703 | -0.089620 | ... | 0.017688 | 0.090049 | 0.018228 | 0.078204 | 0.020897 | 0.092842 | -0.033840 | 0.031766 | 0.020790 | -0.147723 |
| medium_cloud_cover_mid_cld_lay | -0.100980 | 0.135347 | -0.129812 | 0.262367 | 0.042867 | 0.712077 | 0.593300 | 1.000000 | 0.236716 | -0.199843 | ... | 0.016954 | 0.088972 | 0.021935 | 0.076192 | 0.017195 | 0.079627 | -0.035511 | 0.046719 | 0.014802 | -0.227834 |
| low_cloud_cover_low_cld_lay | -0.381876 | 0.490402 | -0.162043 | 0.282748 | 0.151609 | 0.746225 | 0.024703 | 0.236716 | 1.000000 | -0.336751 | ... | 0.040060 | 0.156204 | 0.021782 | 0.153578 | 0.039875 | 0.193846 | 0.013421 | 0.120854 | -0.054328 | -0.288066 |
| shortwave_radiation_backwards_sfc | 0.665755 | -0.721754 | -0.188387 | -0.130358 | -0.073499 | -0.345089 | -0.089620 | -0.199843 | -0.336751 | 1.000000 | ... | -0.076530 | -0.077090 | -0.051670 | 0.028929 | -0.081545 | 0.017212 | -0.576921 | -0.801892 | 0.549296 | 0.556148 |
| wind_speed_10_m_above_gnd | -0.172532 | 0.109674 | -0.170199 | 0.044384 | 0.103749 | 0.175869 | 0.069620 | 0.069307 | 0.161919 | 0.078791 | ... | -0.035788 | 0.957745 | -0.005156 | 0.992851 | -0.017289 | 0.898893 | -0.173060 | -0.041168 | 0.194680 | -0.083043 |
| wind_direction_10_m_above_gnd | 0.051393 | 0.008902 | -0.119867 | 0.005234 | 0.039734 | 0.055057 | 0.017688 | 0.016954 | 0.040060 | -0.076530 | ... | 1.000000 | -0.023300 | 0.891487 | -0.046880 | 0.930226 | 0.059981 | 0.054676 | 0.044775 | 0.009908 | -0.073257 |
| wind_speed_80_m_above_gnd | -0.244869 | 0.212868 | -0.131442 | 0.052376 | 0.093156 | 0.183732 | 0.090049 | 0.088972 | 0.156204 | -0.077090 | ... | -0.023300 | 1.000000 | 0.005862 | 0.969352 | -0.003115 | 0.898347 | -0.049618 | 0.091319 | 0.064278 | -0.157899 |
| wind_direction_80_m_above_gnd | 0.086630 | -0.019408 | -0.161020 | 0.007131 | 0.041246 | 0.039671 | 0.018228 | 0.021935 | 0.021782 | -0.051670 | ... | 0.891487 | 0.005862 | 1.000000 | -0.014577 | 0.919390 | 0.065285 | 0.051170 | 0.029259 | 0.017849 | -0.069941 |
| wind_speed_900_mb | -0.198107 | 0.135464 | -0.145696 | 0.044797 | 0.100405 | 0.174510 | 0.078204 | 0.076192 | 0.153578 | 0.028929 | ... | -0.046880 | 0.969352 | -0.014577 | 1.000000 | -0.026721 | 0.894006 | -0.136442 | 0.004675 | 0.155932 | -0.107615 |
| wind_direction_900_mb | 0.043233 | 0.021068 | -0.125234 | 0.003216 | 0.041716 | 0.057816 | 0.020897 | 0.017195 | 0.039875 | -0.081545 | ... | 0.930226 | -0.003115 | 0.919390 | -0.026721 | 1.000000 | 0.071530 | 0.056517 | 0.048158 | -0.000427 | -0.077435 |
| wind_gust_10_m_above_gnd | -0.188264 | 0.144807 | -0.189266 | 0.066701 | 0.093060 | 0.212142 | 0.092842 | 0.079627 | 0.193846 | 0.017212 | ... | 0.059981 | 0.898347 | 0.065285 | 0.894006 | 0.071530 | 1.000000 | -0.122335 | -0.006612 | 0.152166 | -0.122808 |
| angle_of_incidence | -0.090173 | 0.268460 | -0.075619 | -0.020965 | -0.012497 | -0.003426 | -0.033840 | -0.035511 | 0.013421 | -0.576921 | ... | 0.054676 | -0.049618 | 0.051170 | -0.136442 | 0.056517 | -0.122335 | 1.000000 | 0.712773 | -0.288647 | -0.646537 |
| zenith | -0.545646 | 0.513748 | 0.268111 | -0.023408 | 0.033554 | 0.136249 | 0.031766 | 0.046719 | 0.120854 | -0.801892 | ... | 0.044775 | 0.091319 | 0.029259 | 0.004675 | 0.048158 | -0.006612 | 0.712773 | 1.000000 | -0.247447 | -0.649991 |
| azimuth | 0.381797 | -0.525760 | -0.137872 | 0.005749 | 0.008426 | -0.037427 | 0.020790 | 0.014802 | -0.054328 | 0.549296 | ... | 0.009908 | 0.064278 | 0.017849 | 0.155932 | -0.000427 | 0.152166 | -0.288647 | -0.247447 | 1.000000 | -0.061184 |
| generated_power_kw | 0.217280 | -0.336783 | 0.150551 | -0.118442 | -0.049508 | -0.334338 | -0.147723 | -0.227834 | -0.288066 | 0.556148 | ... | -0.073257 | -0.157899 | -0.069941 | -0.107615 | -0.077435 | -0.122808 | -0.646537 | -0.649991 | -0.061184 | 1.000000 |
21 rows × 21 columns
# Correlation matrix heatmap
plt.figure(figsize=(12, 12))
sns.heatmap(correlation_matrix, annot=True)
plt.show()
Identify the top correlations
Positive Correlations:
# Flatten the correlation matrix and exclude self-correlations
correlation_pairs = correlation_matrix.unstack()
positive_sorted_pairs = correlation_pairs.sort_values(kind="quicksort", ascending=False)
# Exclude self correlations (correlation of a variable with itself will always be 1)
positive_no_self_correlation = positive_sorted_pairs[positive_sorted_pairs != 1]
# Display top positively correlated pairs
positive_no_self_correlation.head(10)
wind_speed_10_m_above_gnd wind_speed_900_mb 0.992851
wind_speed_900_mb wind_speed_10_m_above_gnd 0.992851
wind_speed_80_m_above_gnd 0.969352
wind_speed_80_m_above_gnd wind_speed_900_mb 0.969352
wind_speed_10_m_above_gnd 0.957745
wind_speed_10_m_above_gnd wind_speed_80_m_above_gnd 0.957745
wind_direction_900_mb wind_direction_10_m_above_gnd 0.930226
wind_direction_10_m_above_gnd wind_direction_900_mb 0.930226
wind_direction_900_mb wind_direction_80_m_above_gnd 0.919390
wind_direction_80_m_above_gnd wind_direction_900_mb 0.919390
dtype: float64
Negative Correlations:
negative_sorted_pairs = correlation_pairs.sort_values(kind="quicksort", ascending=True)
# Exclude self correlations (correlation of a variable with itself will always be 1)
negative_no_self_correlation = negative_sorted_pairs[negative_sorted_pairs != 1]
# Display top positively correlated pairs
negative_no_self_correlation.head(10)
zenith shortwave_radiation_backwards_sfc -0.801892
shortwave_radiation_backwards_sfc zenith -0.801892
temperature_2_m_above_gnd relative_humidity_2_m_above_gnd -0.771704
relative_humidity_2_m_above_gnd temperature_2_m_above_gnd -0.771704
shortwave_radiation_backwards_sfc -0.721754
shortwave_radiation_backwards_sfc relative_humidity_2_m_above_gnd -0.721754
zenith generated_power_kw -0.649991
generated_power_kw zenith -0.649991
angle_of_incidence generated_power_kw -0.646537
generated_power_kw angle_of_incidence -0.646537
dtype: float64
Top correlations
# combine negative and positive correlations
top_correlated_pairs = pd.concat(
[positive_no_self_correlation.head(10), negative_no_self_correlation.head(11)]
)
top_correlated_pairs
wind_speed_10_m_above_gnd wind_speed_900_mb 0.992851
wind_speed_900_mb wind_speed_10_m_above_gnd 0.992851
wind_speed_80_m_above_gnd 0.969352
wind_speed_80_m_above_gnd wind_speed_900_mb 0.969352
wind_speed_10_m_above_gnd 0.957745
wind_speed_10_m_above_gnd wind_speed_80_m_above_gnd 0.957745
wind_direction_900_mb wind_direction_10_m_above_gnd 0.930226
wind_direction_10_m_above_gnd wind_direction_900_mb 0.930226
wind_direction_900_mb wind_direction_80_m_above_gnd 0.919390
wind_direction_80_m_above_gnd wind_direction_900_mb 0.919390
zenith shortwave_radiation_backwards_sfc -0.801892
shortwave_radiation_backwards_sfc zenith -0.801892
temperature_2_m_above_gnd relative_humidity_2_m_above_gnd -0.771704
relative_humidity_2_m_above_gnd temperature_2_m_above_gnd -0.771704
shortwave_radiation_backwards_sfc -0.721754
shortwave_radiation_backwards_sfc relative_humidity_2_m_above_gnd -0.721754
zenith generated_power_kw -0.649991
generated_power_kw zenith -0.649991
angle_of_incidence generated_power_kw -0.646537
generated_power_kw angle_of_incidence -0.646537
shortwave_radiation_backwards_sfc angle_of_incidence -0.576921
dtype: float64
Visualize the top correlations
# Extracting the column names for each pair
pairs = [(index[0], index[1]) for index in top_correlated_pairs.index]
# Plotting each pair
# Determine the number of rows and columns for the subplot grid
n_rows = 7
n_cols = 3
# Create a grid of subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, 20))
# Flatten the axes array for easy iteration
axes = axes.flatten()
# Plotting each pair in the grid
for i, (x, y) in enumerate(pairs):
sns.scatterplot(x=x, y=y, data=data, ax=axes[i], alpha=0.5)
# Adjust layout for better spacing
plt.tight_layout()
plt.show()
Visualize correlations of each variable with the target variable
target_variable = "generated_power_kw"
# Extracting all feature names except the target variable
feature_names = [col for col in data.columns if col != target_variable]
# Determine the number of rows and columns for the subplot grid
n_rows = 7
n_cols = 3
# Create a grid of subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, 20))
# Flatten the axes array for easy iteration
axes = axes.flatten()
# Plotting each feature against the target variable
for i, feature in enumerate(feature_names):
sns.scatterplot(x=feature, y=target_variable, data=data, ax=axes[i], alpha=0.5)
# Adjust layout for better spacing
plt.tight_layout()
plt.show()
General Insights:¶
Temperature and Power Generation: The correlation between
temperature_2_m_above_gndandgenerated_power_kwis positive (around 0.22), suggesting that higher temperatures might be associated with increased solar power generation. This could be due to more intense sunlight and favorable conditions for solar panels.Humidity and Cloud Cover: There's a negative correlation between
relative_humidity_2_m_above_gnd,total_cloud_cover_sfc, andgenerated_power_kw(around -0.34 for both), indicating that higher humidity and more cloud cover are likely associated with lower solar power output. This is expected as clouds and humidity can reduce solar radiation reaching the solar panels.Shortwave Radiation: A strong positive correlation (around 0.56) is observed between
shortwave_radiation_backwards_sfcandgenerated_power_kw. This is quite intuitive as more solar radiation directly translates to higher potential for solar power generation.Wind Speed: The correlations between different wind speed measurements (
wind_speed_10_m_above_gnd,wind_speed_80_m_above_gnd,wind_speed_900_mb) andgenerated_power_kware slightly negative, suggesting that higher wind speeds might not directly contribute to increased solar power generation. This might be due to the fact that wind speed doesn't directly affect solar radiation.Angle of Incidence and Zenith: Both
angle_of_incidenceandzenithhave strong negative correlations withgenerated_power_kw(around -0.65 for both). This indicates that the position of the sun (angle and zenith) plays a significant role in power generation, likely due to the varying intensity of solar radiation throughout the day.
The correlation matrix reveals meaningful relationships between various meteorological factors and solar power generation. Key factors like temperature, shortwave radiation, cloud cover, and sun positioning (angle of incidence and zenith) show significant correlations with power generation, aligning well with the fundamental principles of solar energy. Understanding these relationships can guide more detailed analyses, especially for predicting solar power generation based on weather conditions.
Feature Selection and Reduction¶
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
# Separating features and target variable
X = data.drop(target_variable, axis=1)
y = data[target_variable]
Standardizing the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Using PCA to determine the appropriate number of principal components to keep
pca = PCA(random_state=RANDOM_SEED)
X_pca = pca.fit_transform(X_scaled)
Get the Cumulative Proportion of Variance (CPV) explained by each component
cpv = np.cumsum(pca.explained_variance_ratio_)
cpv
array([0.22490081, 0.41335927, 0.55830037, 0.66921138, 0.73559625,
0.79702998, 0.8447486 , 0.88669961, 0.92465582, 0.94380398,
0.96143295, 0.96998183, 0.97707577, 0.98311315, 0.98810758,
0.9920493 , 0.99529021, 0.99830264, 0.9997545 , 1. ])
# Plotting the CPV
plt.figure(figsize=(10, 6))
plt.plot(cpv, marker="o", linestyle="--")
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Proportion of Variance Explained")
plt.title("Cumulative Proportion of Variance (CPV) Explained by PCA Components")
plt.grid()
plt.show()
Based on the CPV, we can see that the first 11 components explain 95% of the variance in the data. This means that we can reduce the number of features from 20 to 11 without losing much information.
# determine the number of components that explain at least 95% of the variance
n_components = np.where(cpv >= 0.95)[0][0] + 1
n_components
11
Using PCA to reduce the number of features to 11
pca_reduced = PCA(n_components=n_components, random_state=RANDOM_SEED)
X_pca_reduced = pca_reduced.fit_transform(X_scaled)
X_pca_reduced.shape
(4213, 11)
Model Selection and Evaluation¶
Using holdout validation, dataset is divided into training and testing sets using a 80-20 split. This means 80% of the data will be used for training the models, and 20% will be used for testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X_pca_reduced, y, test_size=0.2, random_state=RANDOM_SEED
)
X_train.shape, X_test.shape
((3370, 11), (843, 11))
Model selection¶
Different models are considered for this regression task:
- Linear Regression
- Ridge Regression
- Lasso Regression
- Elastic Net Regression
- Random Forest Regression
- Support Vector Regression
- Gradient Boosting Regression
- K-Nearest Neighbors Regression
- Decision Tree Regression
- Neural Network Regression
- Stochastic Gradient Descent Regression
- AdaBoost Regression
from sklearn.linear_model import (
LinearRegression,
Ridge,
ElasticNet,
Lasso,
SGDRegressor,
)
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import (
mean_absolute_error,
mean_squared_error,
r2_score,
mean_absolute_percentage_error,
)
Define the models with default parameters
models = {
"Linear Regression": LinearRegression(),
"Ridge": Ridge(random_state=RANDOM_SEED),
"Lasso": Lasso(random_state=RANDOM_SEED),
"Elastic Net": ElasticNet(random_state=RANDOM_SEED),
"Random Forest": RandomForestRegressor(random_state=RANDOM_SEED),
"Support Vector Regressor": SVR(),
"Gradient Boosting Regressor": GradientBoostingRegressor(random_state=RANDOM_SEED),
"K-Nearest Neighbors": KNeighborsRegressor(),
"Decision Tree": DecisionTreeRegressor(),
"Neural Network": MLPRegressor(random_state=RANDOM_SEED),
"Stochastic Gradient Descent": SGDRegressor(random_state=RANDOM_SEED),
"Adaboost Regressor": AdaBoostRegressor(random_state=RANDOM_SEED),
}
Define the regression metrics to evaluate the models
def regression_metrics(y_true, y_pred):
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)
mape = mean_absolute_percentage_error(y_true, y_pred)
return {
"Mean Absolute Error": mae,
"Mean Squared Error": mse,
"Root Mean Squared Error": rmse,
"R2 Score": r2,
"Mean Absolute Percentage Error": mape,
"Mean Bias Error": np.mean(y_pred - y_true),
}
Train the models and evaluate performance on the test set
regression_results = {}
for name, model in models.items():
print(f"Training {name}...")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
regression_results[name] = regression_metrics(y_test, y_pred)
regression_results_df = pd.DataFrame(regression_results).T
regression_results_df
Training Linear Regression... Training Ridge... Training Lasso... Training Elastic Net... Training Random Forest... Training Support Vector Regressor... Training Gradient Boosting Regressor... Training K-Nearest Neighbors... Training Decision Tree... Training Neural Network...
/opt/homebrew/lib/python3.11/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. warnings.warn(
Training Stochastic Gradient Descent... Training Adaboost Regressor...
| Mean Absolute Error | Mean Squared Error | Root Mean Squared Error | R2 Score | Mean Absolute Percentage Error | Mean Bias Error | |
|---|---|---|---|---|---|---|
| Linear Regression | 399.579896 | 267512.251601 | 517.215866 | 0.706040 | 482.908191 | -28.937825 |
| Ridge | 399.597218 | 267515.297004 | 517.218810 | 0.706036 | 482.901372 | -28.945068 |
| Lasso | 400.000643 | 267604.829700 | 517.305354 | 0.705938 | 482.911397 | -29.218607 |
| Elastic Net | 457.783607 | 310797.582356 | 557.492226 | 0.658475 | 636.834457 | -36.073312 |
| Random Forest | 325.025954 | 209870.702463 | 458.116473 | 0.769380 | 185.637110 | -15.090408 |
| Support Vector Regressor | 765.358677 | 770179.800557 | 877.598884 | 0.153675 | 1148.574269 | -218.317426 |
| Gradient Boosting Regressor | 358.759991 | 242029.866308 | 491.965310 | 0.734042 | 267.367222 | -25.521922 |
| K-Nearest Neighbors | 328.165821 | 225101.558087 | 474.448689 | 0.752643 | 455.182658 | 0.885740 |
| Decision Tree | 422.534791 | 407765.408259 | 638.565117 | 0.551920 | 471.186008 | -12.077871 |
| Neural Network | 369.533026 | 262163.305875 | 512.018853 | 0.711918 | 211.139910 | -68.978097 |
| Stochastic Gradient Descent | 399.704186 | 266651.370733 | 516.382969 | 0.706986 | 504.922495 | -31.717054 |
| Adaboost Regressor | 583.034156 | 426736.188421 | 653.250479 | 0.531074 | 749.475476 | 31.755897 |
Selecting the best model¶
Select the best model based on their performance on the test set
# sort the results by R2 score in descending order
regression_results_df.sort_values(by="R2 Score", ascending=False)
| Mean Absolute Error | Mean Squared Error | Root Mean Squared Error | R2 Score | Mean Absolute Percentage Error | Mean Bias Error | |
|---|---|---|---|---|---|---|
| Random Forest | 325.025954 | 209870.702463 | 458.116473 | 0.769380 | 185.637110 | -15.090408 |
| K-Nearest Neighbors | 328.165821 | 225101.558087 | 474.448689 | 0.752643 | 455.182658 | 0.885740 |
| Gradient Boosting Regressor | 358.759991 | 242029.866308 | 491.965310 | 0.734042 | 267.367222 | -25.521922 |
| Neural Network | 369.533026 | 262163.305875 | 512.018853 | 0.711918 | 211.139910 | -68.978097 |
| Stochastic Gradient Descent | 399.704186 | 266651.370733 | 516.382969 | 0.706986 | 504.922495 | -31.717054 |
| Linear Regression | 399.579896 | 267512.251601 | 517.215866 | 0.706040 | 482.908191 | -28.937825 |
| Ridge | 399.597218 | 267515.297004 | 517.218810 | 0.706036 | 482.901372 | -28.945068 |
| Lasso | 400.000643 | 267604.829700 | 517.305354 | 0.705938 | 482.911397 | -29.218607 |
| Elastic Net | 457.783607 | 310797.582356 | 557.492226 | 0.658475 | 636.834457 | -36.073312 |
| Decision Tree | 422.534791 | 407765.408259 | 638.565117 | 0.551920 | 471.186008 | -12.077871 |
| Adaboost Regressor | 583.034156 | 426736.188421 | 653.250479 | 0.531074 | 749.475476 | 31.755897 |
| Support Vector Regressor | 765.358677 | 770179.800557 | 877.598884 | 0.153675 | 1148.574269 | -218.317426 |
Using the default parameters, the best model is the Random Forest Regression model, with a RMSE of 458 and R2 score of 0.769. The worst model is the Support Vector model, with a RMSE of 877 and R2 score of 0.153.
We will proceed with the Random Forest Regression model and tune its hyperparameters to improve its performance.
Hyperparameter Tuning for the Random Forest Regression Model¶
Using Grid Search to find the best hyperparameters
from sklearn.model_selection import GridSearchCV
Define the hyperparameters to tune
param_grid = {
"n_estimators": [200, 300], # number of trees
"max_depth": [10, None], # maximum depth of each tree
"min_samples_split": [2, 5], # minimum number of samples required to split a node
"min_samples_leaf": [1, 2], # minimum number of samples required at each leaf node
"max_features": ["sqrt", "log2"], # number of features to consider at each split
"bootstrap": [True, False], # method of selecting samples for training each tree
"random_state": [RANDOM_SEED], # random seed
}
rf = RandomForestRegressor()
Using 3-Fold Cross Validation to evaluate the model with different hyperparameter values
# Instantiate the grid search model
grid_search = GridSearchCV(
estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2
)
# Fit the grid search to the data
scores = grid_search.fit(X_train, y_train)
Fitting 3 folds for each of 64 candidates, totalling 192 fits [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.1s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 0.8s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.1s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.6s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.7s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 0.9s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.5s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.5s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 0.9s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 0.9s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 0.9s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.3s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.3s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 0.9s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 0.9s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.3s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.3s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 0.8s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 0.9s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.6s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 0.8s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.5s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 0.9s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.3s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 0.9s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.2s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 0.8s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.3s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.2s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.2s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.1s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.2s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.7s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.9s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.1s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.8s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.6s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.7s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 0.9s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.5s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.3s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 0.9s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.6s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.1s [CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.5s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.1s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.1s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.7s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.9s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.7s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.6s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.1s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.7s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.1s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.6s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.5s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.0s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 0.9s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.6s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.4s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.7s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.1s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.8s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.8s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.3s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.9s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.9s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.0s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.4s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.4s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.0s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.9s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.3s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.1s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.6s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.2s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.3s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.0s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.4s [CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.1s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.0s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.3s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.3s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.0s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.4s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.0s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.9s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.1s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.0s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.3s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.8s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.8s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.8s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.2s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 1.7s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.0s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.9s [CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.9s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.7s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.7s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.9s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.9s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.9s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 3.4s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.9s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 3.3s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 3.3s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 3.1s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 3.1s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.8s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.9s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.8s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.9s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.6s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.7s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.5s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.3s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.2s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.2s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.7s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.8s [CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.2s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.7s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.5s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.8s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.6s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.5s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.4s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.4s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.4s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.6s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.5s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.7s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time= 1.5s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time= 2.1s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 2.0s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.9s [CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time= 1.6s
The best hypermarameters for the Random Forest Regression model are:
grid_search.best_params_
{'bootstrap': False,
'max_depth': None,
'max_features': 'sqrt',
'min_samples_leaf': 1,
'min_samples_split': 2,
'n_estimators': 300,
'random_state': 43}
# Using the best estimator that is trained using 3-fold cross validation
regression_metrics(y_test, grid_search.best_estimator_.predict(X_test))
{'Mean Absolute Error': 320.4081448458532,
'Mean Squared Error': 200021.18391009144,
'Root Mean Squared Error': 447.2372792043296,
'R2 Score': 0.7802034371098439,
'Mean Absolute Percentage Error': 337.93869897545613,
'Mean Bias Error': -13.199074736835223}
from sklearn.pipeline import make_pipeline
# A function to train a model and evaluate it
def model_it(X, y, model, dim_reduce=None, random_state=RANDOM_SEED, test_size=0.2):
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=random_state
)
steps = [StandardScaler()]
if dim_reduce:
steps.append(dim_reduce)
steps.append(model)
pipeline = make_pipeline(*steps)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
return pipeline, regression_metrics(y_test, y_pred), (y_test, y_pred)
# Using the best estimator parameters, train the model and evaluate it
(
pipeline,
metrics,
_,
) = model_it(
X,
y,
RandomForestRegressor(**grid_search.best_params_),
dim_reduce=PCA(n_components=11, random_state=RANDOM_SEED),
random_state=RANDOM_SEED,
)
pipeline, metrics
(Pipeline(steps=[('standardscaler', StandardScaler()),
('pca', PCA(n_components=11, random_state=43)),
('randomforestregressor',
RandomForestRegressor(bootstrap=False, max_features='sqrt',
n_estimators=300, random_state=43))]),
{'Mean Absolute Error': 320.06003789818317,
'Mean Squared Error': 198856.6440063794,
'Root Mean Squared Error': 445.9334524414819,
'R2 Score': 0.781483110908292,
'Mean Absolute Percentage Error': 336.55505317985006,
'Mean Bias Error': -10.603309197779827})
Trying out models without feature reduction¶
no_dim_reduce_regression_results = {}
for name, model in models.items():
print(f"Modeling {name}...")
pipeline, metrics, _ = model_it(
X, y, model, dim_reduce=None, random_state=RANDOM_SEED
)
no_dim_reduce_regression_results[name] = metrics
no_dim_reduce_regression_results_df = pd.DataFrame(no_dim_reduce_regression_results).T
no_dim_reduce_regression_results_df.sort_values(by="R2 Score", ascending=False)
Modeling Linear Regression... Modeling Ridge... Modeling Lasso... Modeling Elastic Net... Modeling Random Forest... Modeling Support Vector Regressor... Modeling Gradient Boosting Regressor... Modeling K-Nearest Neighbors... Modeling Decision Tree... Modeling Neural Network...
/opt/homebrew/lib/python3.11/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet. warnings.warn(
Modeling Stochastic Gradient Descent... Modeling Adaboost Regressor...
| Mean Absolute Error | Mean Squared Error | Root Mean Squared Error | R2 Score | Mean Absolute Percentage Error | Mean Bias Error | |
|---|---|---|---|---|---|---|
| Random Forest | 256.803865 | 167480.645723 | 409.243993 | 0.815961 | 374.045361 | -10.243749 |
| Gradient Boosting Regressor | 288.253586 | 181831.952450 | 426.417580 | 0.800191 | 442.914880 | -10.369109 |
| K-Nearest Neighbors | 321.737687 | 222159.164973 | 471.337634 | 0.755877 | 459.336895 | 10.986202 |
| Stochastic Gradient Descent | 390.417655 | 258539.957997 | 508.468247 | 0.715899 | 392.405691 | -29.985635 |
| Lasso | 391.941881 | 259922.668352 | 509.826116 | 0.714380 | 373.129578 | -28.181217 |
| Ridge | 391.667577 | 260409.720046 | 510.303557 | 0.713845 | 378.620236 | -28.480979 |
| Linear Regression | 391.696464 | 260518.340859 | 510.409973 | 0.713725 | 379.165057 | -28.510247 |
| Neural Network | 375.809152 | 264079.350993 | 513.886516 | 0.709812 | 217.355974 | -71.569474 |
| Adaboost Regressor | 450.733503 | 297565.555569 | 545.495697 | 0.673015 | 589.670260 | -43.050280 |
| Elastic Net | 455.562872 | 307948.782428 | 554.931331 | 0.661605 | 604.902121 | -36.053680 |
| Decision Tree | 324.741653 | 324060.063400 | 569.262737 | 0.643901 | 419.247085 | -2.952272 |
| Support Vector Regressor | 769.850677 | 779939.486008 | 883.141827 | 0.142951 | 1146.575229 | -221.312917 |
Trying Random Forest Regression without feature reduction with the best hyperparameters
no_dim_reduce_rf_pipeline, no_dim_reduce_rf_metrics, no_dim_reduce_rf_ys = model_it(
X,
y,
RandomForestRegressor(**grid_search.best_params_),
dim_reduce=None,
random_state=RANDOM_SEED,
)
no_dim_reduce_rf_pipeline
Pipeline(steps=[('standardscaler', StandardScaler()),
('randomforestregressor',
RandomForestRegressor(bootstrap=False, max_features='sqrt',
n_estimators=300, random_state=43))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
('randomforestregressor',
RandomForestRegressor(bootstrap=False, max_features='sqrt',
n_estimators=300, random_state=43))])StandardScaler()
RandomForestRegressor(bootstrap=False, max_features='sqrt', n_estimators=300,
random_state=43)no_dim_reduce_rf_model = no_dim_reduce_rf_pipeline.named_steps["randomforestregressor"]
no_dim_reduce_rf_model
RandomForestRegressor(bootstrap=False, max_features='sqrt', n_estimators=300,
random_state=43)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(bootstrap=False, max_features='sqrt', n_estimators=300,
random_state=43)no_dim_reduce_rf_metrics
{'Mean Absolute Error': 256.551089927555,
'Mean Squared Error': 155789.53863814945,
'Root Mean Squared Error': 394.7018351086671,
'R2 Score': 0.8288081069338133,
'Mean Absolute Percentage Error': 398.1254764493029,
'Mean Bias Error': -14.762168127768302}
It looks like the model performs slightly better without feature reduction, with a RMSE of 394 and R2 score of 0.828. This is expected as the Random Forest model is able to handle multicollinearity and feature interactions well.
Trying out Ensemble learning using stacking¶
from sklearn.ensemble import StackingRegressor
# define the base models
base_regressors = [
('svr', SVR()),
('ada', AdaBoostRegressor(random_state=RANDOM_SEED)),
('knn', KNeighborsRegressor()),
('rf', RandomForestRegressor(**grid_search.best_params_)),
]
# define the stacking ensemble
stacked_regressor = StackingRegressor(
estimators=base_regressors,
final_estimator=LinearRegression(),
verbose=2,
)
stacked_pipeline, stacked_metrics, stacked_ys = model_it(
X,
y,
stacked_regressor,
dim_reduce=None,
random_state=RANDOM_SEED,
)
stacked_pipeline, stacked_metrics
(Pipeline(steps=[('standardscaler', StandardScaler()),
('stackingregressor',
StackingRegressor(estimators=[('svr', SVR()),
('ada',
AdaBoostRegressor(random_state=43)),
('knn', KNeighborsRegressor()),
('rf',
RandomForestRegressor(bootstrap=False,
max_features='sqrt',
n_estimators=300,
random_state=43))],
final_estimator=LinearRegression(),
verbose=2))]),
{'Mean Absolute Error': 247.22433880842823,
'Mean Squared Error': 149659.85233279874,
'Root Mean Squared Error': 386.85895664027055,
'R2 Score': 0.8355438133983674,
'Mean Absolute Percentage Error': 365.5321772342633,
'Mean Bias Error': -13.285098460772305})
The Final Model¶
final_model = stacked_pipeline
final_model
Pipeline(steps=[('standardscaler', StandardScaler()),
('stackingregressor',
StackingRegressor(estimators=[('svr', SVR()),
('ada',
AdaBoostRegressor(random_state=43)),
('knn', KNeighborsRegressor()),
('rf',
RandomForestRegressor(bootstrap=False,
max_features='sqrt',
n_estimators=300,
random_state=43))],
final_estimator=LinearRegression(),
verbose=2))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
('stackingregressor',
StackingRegressor(estimators=[('svr', SVR()),
('ada',
AdaBoostRegressor(random_state=43)),
('knn', KNeighborsRegressor()),
('rf',
RandomForestRegressor(bootstrap=False,
max_features='sqrt',
n_estimators=300,
random_state=43))],
final_estimator=LinearRegression(),
verbose=2))])StandardScaler()
StackingRegressor(estimators=[('svr', SVR()),
('ada', AdaBoostRegressor(random_state=43)),
('knn', KNeighborsRegressor()),
('rf',
RandomForestRegressor(bootstrap=False,
max_features='sqrt',
n_estimators=300,
random_state=43))],
final_estimator=LinearRegression(), verbose=2)SVR()
AdaBoostRegressor(random_state=43)
KNeighborsRegressor()
RandomForestRegressor(bootstrap=False, max_features='sqrt', n_estimators=300,
random_state=43)LinearRegression()
y_test, y_pred = stacked_ys
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.title("Actual vs Predicted Values")
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.plot(
[y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "k--", lw=2
) # Diagonal line
plt.show()
The final model defines a stacked regression model, an ensemble learning technique combining multiple base regressors and a final estimator.
The base regressors include a Support Vector Regressor (SVR), an AdaBoostRegressor with a random state for reproducibility, a K-Nearest Neighbors Regressor (KNN), and a RandomForestRegressor whose parameters are optimized using grid search. These diverse models are trained independently to capture different patterns in the data. The predictions from these base models are then used as input to a final estimator, which is a Linear Regression model. This final model learns how to optimally combine the predictions of the base models to produce a more accurate and robust final prediction.
The model underwent training with a 80-20 data split and utilized the following configuration:
- Model Configuration:
Stacking Regressor:
Base Regressors:
- Support Vector Regressor (SVR): A regression algorithm based on support vector machine theory.
- AdaBoostRegressor: An ensemble method using boosting, focusing on difficult cases in successive iterations, and initialized with a fixed random state for reproducibility.
- K-Nearest Neighbors Regressor (KNN): An instance-based learning method predicting responses by interpolating the targets of nearest neighbors.
- Random Forest Regressor: An ensemble of decision trees, optimized using parameters determined from a previous grid search.
- Parameters:
bootstrap=False,max_features='sqrt',n_estimators=300,random_state=43
- Parameters:
Final Estimator:
- Linear Regression
Data Preprocessing: Standard Scaling applied to features
Performance Metrics Analysis¶
Mean Absolute Error (MAE) - 247.22 kW:
- The model, on average, errs by about 247.22 kW. This metric indicates the average absolute deviation of the model predictions from the actual values.
Mean Squared Error (MSE) - 149659.85 kW²:
- The MSE, representing the average squared differences between predicted and actual values, is 149659.85 kW². This high value suggests that the model may have instances of significant errors in its predictions.
Root Mean Squared Error (RMSE) - 386.85 kW:
- The RMSE, which is the square root of MSE, suggests that typical prediction errors are about 386.85 kW. This value provides an understanding of the error magnitude in the same unit as the target variable (kW).
R2 Score - 0.8355:
- An R2 Score of 0.8355 indicates that approximately 83.55% of the variance in solar PV output is predictable by the model. This high R² value points to a strong fit to the data.
Mean Absolute Percentage Error (MAPE) - 365.53%:
- The high MAPE suggests that there are instances where the model's predictions are significantly off from the actual values. This could be more pronounced in cases with lower solar PV outputs.
Mean Bias Error (MBE) - -13.285 kW:
- The negative MBE indicates a slight tendency of the model to underestimate the solar PV output, with an average underestimation of around 13.285 kW.
The mofrl demonstrates strong predictive capabilities, particularly highlighted by the R2 Score. However, the significant MAE, MSE, and particularly the high MAPE, point to potential accuracy issues, especially in scenarios of lower solar PV output. The negative MBE suggests a consistent underestimation trend.
Feature Importance Analysis (based on RandomForestRegressor)¶
importances = no_dim_reduce_rf_model.feature_importances_
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(12, 6))
plt.title("Feature Importances in RandomForest Regressor")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), [X.columns[i] for i in indices], rotation=90)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.show()
# display as table
pd.DataFrame(
{
"Feature": [X.columns[i] for i in indices],
"Importance": importances[indices],
}
)
| Feature | Importance | |
|---|---|---|
| 0 | zenith | 0.202861 |
| 1 | angle_of_incidence | 0.199005 |
| 2 | azimuth | 0.124835 |
| 3 | shortwave_radiation_backwards_sfc | 0.123308 |
| 4 | total_cloud_cover_sfc | 0.050858 |
| 5 | relative_humidity_2_m_above_gnd | 0.048323 |
| 6 | mean_sea_level_pressure_MSL | 0.034634 |
| 7 | low_cloud_cover_low_cld_lay | 0.030489 |
| 8 | temperature_2_m_above_gnd | 0.027970 |
| 9 | medium_cloud_cover_mid_cld_lay | 0.020056 |
| 10 | wind_gust_10_m_above_gnd | 0.019489 |
| 11 | wind_speed_80_m_above_gnd | 0.017278 |
| 12 | wind_speed_900_mb | 0.016905 |
| 13 | wind_speed_10_m_above_gnd | 0.016809 |
| 14 | wind_direction_10_m_above_gnd | 0.016469 |
| 15 | wind_direction_900_mb | 0.016216 |
| 16 | wind_direction_80_m_above_gnd | 0.015625 |
| 17 | high_cloud_cover_high_cld_lay | 0.011792 |
| 18 | total_precipitation_sfc | 0.006796 |
| 19 | snowfall_amount_sfc | 0.000281 |
The feature importance results from RandomForest Regressor provide valuable insights into which features are most influential in predicting the target variable in the dataset. Here's an analysis of the results:
Top Features¶
- Zenith (20.28%): The most important feature. The zenith angle of the sun seems to have the highest impact on the model's predictions, suggesting that the position of the sun in the sky is crucial in determining solar power generation.
- Angle of Incidence (19.9%): The second most important feature. It indicates the angle at which the sunlight strikes the solar panels, impacting their efficiency and, consequently, the power generation.
Other Significant Features¶
- Azimuth (12.48%) and Shortwave Radiation Backwards SFC (12.33%): Both are related to the position and intensity of sunlight, emphasizing the importance of solar irradiance and panel orientation in solar power generation.
- Relative Humidity and Total Cloud Cover: These atmospheric conditions significantly affect solar panel efficiency by influencing the amount of solar radiation that reaches the panels.
Lower Impact Features¶
- Weather-related features like Mean Sea Level Pressure, Low Cloud Cover, and Temperature have moderate importance, indirectly affecting solar power generation by influencing the local climate and weather patterns.
- Wind-related Features: Wind speed and direction at different levels have lower importance, indicating that they are less critical for power generation compared to solar irradiance and panel orientation.
Least Important Features¶
- Snowfall Amount SFC: Has the least importance, possibly due to its minimal impact on solar power generation in most scenarios or due to the rarity of snowfall events in the dataset.
Summary¶
- The analysis highlights the critical role of solar irradiance (zenith, angle of incidence, azimuth, shortwave radiation) in solar power generation.
- Atmospheric conditions (humidity, cloud cover) play a significant role but are secondary to direct solar parameters.
- The relatively lower importance of temperature and wind-related features suggests that these factors are less critical in predicting solar power generation compared to direct sunlight-related features.
- These insights can guide more targeted data collection, feature engineering, and model refinement efforts in the field of solar energy forecasting.